USHIFT-6808: C2CC Chaos Tests (Restarting Services, toggling ethernet)#6930
USHIFT-6808: C2CC Chaos Tests (Restarting Services, toggling ethernet)#6930pmtk wants to merge 5 commits into
Conversation
|
@pmtk: This pull request references USHIFT-6808 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds a C2CC chaos resilience test suite ( ChangesC2CC Chaos Suite and Shared Resource Extensions
Sequence Diagram(s)sequenceDiagram
participant ChaosTest as Chaos Test Case
participant c2cc.resource as c2cc.resource Keywords
participant Cluster as Target Cluster VM
participant Observers as Observer Clusters
rect rgba(70, 130, 180, 0.5)
note over ChaosTest,Observers: Pre-disruption gate
ChaosTest->>c2cc.resource: Ensure All Clusters Healthy
c2cc.resource->>Cluster: Wait Until Keyword Succeeds — Verify RemoteCluster State (Healthy)
end
rect rgba(180, 80, 80, 0.5)
note over ChaosTest,Cluster: Fault injection
ChaosTest->>Cluster: Inject disruption (restart service / delete pod / Disable All NICs For VM)
ChaosTest->>Observers: Verify RemoteCluster Unhealthy On Observers
end
rect rgba(60, 160, 60, 0.5)
note over ChaosTest,Observers: Recovery verification
ChaosTest->>c2cc.resource: Wait Until Keyword Succeeds — Ensure All Clusters Healthy
ChaosTest->>c2cc.resource: Verify Full Recovery On Clusters
c2cc.resource->>Cluster: Verify C2CC Infrastructure On Cluster
c2cc.resource->>Cluster: Verify Cross Cluster Connectivity
c2cc.resource->>Cluster: Verify Cross Cluster DNS
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pmtk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/bin/c2cc_common.sh`:
- Around line 186-193: The `setup_hosts` logic is handling `get_host_ip`
failures with early returns, but `full_vm_name` for `host2_vm` and `host3_vm` is
not checked the same way. Update the assignments around `full_vm_name`,
`host2_vm`, and `host3_vm` so a failure aborts immediately instead of
propagating empty `HOST2_VM_NAME`/`HOST3_VM_NAME` values into the NIC chaos
keywords. Keep the error handling consistent with the existing `get_host_ip`
pattern by returning from the function on lookup failure before the `readonly`
exports.
In `@test/resources/c2cc.resource`:
- Around line 562-563: The deregistration cleanup currently always calls
SSHLibrary.Close Connection even when SSHLibrary.Switch Connection fails, which
can close the wrong active SSH session on повторное deregistration. Update the
teardown logic in the affected resource so Close Connection only runs when the
switch to ${alias} succeeds, using the existing SSHLibrary.Switch Connection and
SSHLibrary.Close Connection keywords together with a conditional or status check
to preserve the current session when the alias is missing.
- Around line 541-549: In Disable All NICs For VM, add a fail-fast guard after
Get Vnet Devices For MicroShift Host returns ${vnet_ifaces} so the test errors
immediately when no vnet devices are found instead of silently doing nothing.
Check for an empty list before the FOR loop, and raise a clear failure that
includes ${vm_name} and the lack of discovered NICs so the issue is obvious in
the test output.
- Around line 483-488: The precondition gate in Ensure All Clusters Healthy only
verifies RemoteCluster CR status and can let a degraded local MicroShift/OVN-K
environment pass. Update this keyword to run the actual cluster healthcheck
before fault injection, alongside or instead of Verify RemoteCluster State, so
the setup blocks unless the real cluster health is healthy for each alias
(cluster-a, cluster-b, cluster-c).
- Around line 170-184: The IP rule checks in Verify Service IP Rules and Verify
All IP Rules only validate the rule targets, so they can miss wrong ordering.
Update these Robot Framework keywords to also assert the expected priority value
for each rule returned by Command On Cluster, using the existing `${stdout}`
checks with Should Contain against the full rule strings so table 200/201
entries are verified at the correct priority.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9ab20d10-7e3c-468d-a472-1cea8e83ce11
📒 Files selected for processing (6)
test/bin/c2cc_common.shtest/resources/c2cc.resourcetest/scenarios-bootc/c2cc/el98-src@c2cc-chaos.shtest/suites/c2cc/chaos.robottest/suites/c2cc/cleanup.robottest/suites/c2cc/probe.robot
💤 Files with no reviewable changes (2)
- test/suites/c2cc/cleanup.robot
- test/suites/c2cc/probe.robot
- c2cc_common.sh: add || return 1 to full_vm_name calls for consistency - c2cc.resource: fail fast when no vnet interfaces found for VM - c2cc.resource: only close SSH connection after successful switch Co-Authored-By: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- c2cc_common.sh: add || return 1 to full_vm_name calls for consistency - c2cc.resource: fail fast when no vnet interfaces found for VM - c2cc.resource: only close SSH connection after successful switch
30371f0 to
0d99900
Compare
|
/test ? |
|
/test e2e-aws-tests-bootc-c2cc e2e-aws-tests-bootc-c2cc-arm |
MicroShift, NetworkManager, OVN-K restarts. NIC disabling.
- c2cc_common.sh: add || return 1 to full_vm_name calls for consistency - c2cc.resource: fail fast when no vnet interfaces found for VM - c2cc.resource: only close SSH connection after successful switch
0d99900 to
bcfed52
Compare
|
/test e2e-aws-tests-bootc-c2cc e2e-aws-tests-bootc-c2cc-arm |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/suites/c2cc/chaos.robot`:
- Line 55: The cluster command in the chaos test is too disruptive for Command
On Cluster because restarting NetworkManager can drop the SSH session before the
call returns. Update the test in chaos.robot to use Disruptive Command On
Cluster for the systemctl restart NetworkManager step, using the existing
cluster-c command block so the session loss is handled safely.
- Around line 64-76: The disabled-state flag is set too late in the chaos test
flow, so a failure inside Disable All NICs For VM can leave NICs down while
Restore NICs And Reconnect sees no disabled VM to clean up. In the c2cc chaos
scenario, set ${DISABLED_VM} before calling Disable All NICs For VM and keep the
existing reset after recovery; use the existing keywords Disable All NICs For
VM, Restore NICs And Reconnect, and ${DISABLED_VM} to locate the change.
- Around line 149-157: The setup flow is missing validation for VM name
environment variables, which lets NIC-outage tests fail later with unclear
errors. Update Check Required Env Variables to also require HOST2_VM_NAME and
HOST3_VM_NAME alongside the existing host/IP/port/kubeconfig checks, and ensure
the Setup sequence still calls that validation before Register Remote Cluster so
failures surface early with a clear message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 8b370a02-f8f1-4408-b570-8eb1581bfc0e
📒 Files selected for processing (5)
test/bin/c2cc_common.shtest/resources/c2cc.resourcetest/scenarios-bootc/c2cc/el98-src@c2cc-chaos.shtest/suites/c2cc/chaos.robottest/suites/c2cc/probe.robot
💤 Files with no reviewable changes (1)
- test/suites/c2cc/probe.robot
🚧 Files skipped from review as they are similar to previous changes (2)
- test/bin/c2cc_common.sh
- test/resources/c2cc.resource
| [Documentation] Restart NetworkManager on cluster-c. | ||
| ... Verify kernel routes/rules are restored and connectivity recovers. | ||
| [Setup] Ensure All Clusters Healthy | ||
| Command On Cluster cluster-c systemctl restart NetworkManager |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP -C5 '^(Command On Cluster|Disruptive Command On Cluster)\b' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 1266
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== test/suites/c2cc/chaos.robot =="
wc -l test/suites/c2cc/chaos.robot
sed -n '1,180p' test/suites/c2cc/chaos.robot
echo
echo "== test/resources/c2cc.resource =="
wc -l test/resources/c2cc.resource
sed -n '1,180p' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 16640
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== test/suites/c2cc/chaos.robot =="
wc -l test/suites/c2cc/chaos.robot
sed -n '1,180p' test/suites/c2cc/chaos.robot
echo
echo "== test/resources/c2cc.resource =="
wc -l test/resources/c2cc.resource
sed -n '1,180p' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 16640
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== test/suites/c2cc/chaos.robot =="
wc -l test/suites/c2cc/chaos.robot
sed -n '1,180p' test/suites/c2cc/chaos.robot
echo
echo "== test/resources/c2cc.resource =="
wc -l test/resources/c2cc.resource
sed -n '1,180p' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 16640
Switch this to Disruptive Command On Cluster systemctl restart NetworkManager can tear down the SSH session before Command On Cluster returns, so this test can fail intermittently.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/suites/c2cc/chaos.robot` at line 55, The cluster command in the chaos
test is too disruptive for Command On Cluster because restarting NetworkManager
can drop the SSH session before the call returns. Update the test in chaos.robot
to use Disruptive Command On Cluster for the systemctl restart NetworkManager
step, using the existing cluster-c command block so the session loss is handled
safely.
| ${vnet_ifaces} Disable All NICs For VM ${HOST2_VM_NAME} | ||
| VAR ${DISABLED_VM} ${HOST2_VM_NAME} scope=TEST | ||
| VAR @{DISABLED_IFACES} @{vnet_ifaces} scope=TEST | ||
| Verify RemoteCluster Unhealthy On Observers ${HOST2_IP} cluster-a cluster-c | ||
| Enable All NICs For VM ${HOST2_VM_NAME} ${vnet_ifaces} | ||
| Reconnect To Cluster cluster-b ${HOST2_IP} ${HOST2_SSH_PORT} ${KUBECONFIG_B} | ||
| ... timeout=${RECOVERY_TIMEOUT} | ||
| VAR ${DISABLED_VM} ${EMPTY} scope=TEST | ||
| Wait Until Keyword Succeeds ${RECOVERY_TIMEOUT} ${RECOVERY_RETRY} | ||
| ... Verify Cluster Is Healthy cluster-b | ||
| Verify Full Recovery On Clusters cluster-b | ||
| [Teardown] Restore NICs And Reconnect | ||
| ... ${HOST2_VM_NAME} cluster-b ${HOST2_IP} ${HOST2_SSH_PORT} ${KUBECONFIG_B} |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP -C8 '^Disable All NICs For VM\b' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 1022
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '541,575p' test/resources/c2cc.resource
printf '\n---\n'
rg -n "DISABLED_VM|Restore NICs And Reconnect|Disable NICs|Reconnect To Cluster" test/suites/c2cc/chaos.robot test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 2965
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,140p' test/suites/c2cc/chaos.robotRepository: openshift/microshift
Length of output: 7246
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,220p' test/suites/c2cc/chaos.robot
printf '\n--- resource snippet ---\n'
sed -n '541,620p' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 10455
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C6 "Restore NICs And Reconnect|DISABLED_VM" test/resources/c2cc.resource test/suites/c2cc/chaos.robotRepository: openshift/microshift
Length of output: 6798
🏁 Script executed:
#!/bin/bash
set -euo pipefail
grep -nR "DISABLED_VM" -n test/resources/c2cc.resource test/suites/c2cc/chaos.robotRepository: openshift/microshift
Length of output: 646
🏁 Script executed:
#!/bin/bash
set -euo pipefail
wc -l test/resources/c2cc.resource
sed -n '620,720p' test/resources/c2cc.resourceRepository: openshift/microshift
Length of output: 191
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C8 "Restore NICs And Reconnect|Disable All NICs For VM|DISABLED_VM|DISABLED_IFACES" testRepository: openshift/microshift
Length of output: 9865
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
p = Path("test/resources/c2cc.resource")
for i, line in enumerate(p.read_text().splitlines(), 1):
if "Restore NICs And Reconnect" in line or "DISABLED_VM" in line or "DISABLED_IFACES" in line:
start = max(1, i-4)
end = min(i+12, sum(1 for _ in p.open()))
print(f"--- around line {i} ---")
for j, l in enumerate(p.read_text().splitlines()[start-1:end], start):
print(f"{j}: {l}")
PYRepository: openshift/microshift
Length of output: 158
Move the disabled-state flag before the NIC-disable call. In test/suites/c2cc/chaos.robot, Disable All NICs For VM can fail after some NICs are already down, but ${DISABLED_VM} stays empty until the call returns, so Restore NICs And Reconnect skips cleanup on that path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/suites/c2cc/chaos.robot` around lines 64 - 76, The disabled-state flag
is set too late in the chaos test flow, so a failure inside Disable All NICs For
VM can leave NICs down while Restore NICs And Reconnect sees no disabled VM to
clean up. In the c2cc chaos scenario, set ${DISABLED_VM} before calling Disable
All NICs For VM and keep the existing reset after recovery; use the existing
keywords Disable All NICs For VM, Restore NICs And Reconnect, and ${DISABLED_VM}
to locate the change.
| Setup | ||
| [Documentation] Set up clusters and deploy test workloads. | ||
| Check Required Env Variables | ||
| Login MicroShift Host | ||
| Setup Kubeconfig | ||
| Register Local Cluster cluster-a | ||
| Register Remote Cluster cluster-b ${HOST2_IP} ${HOST2_SSH_PORT} ${KUBECONFIG_B} | ||
| Register Remote Cluster cluster-c ${HOST3_IP} ${HOST3_SSH_PORT} ${KUBECONFIG_C} | ||
| Deploy Test Workloads |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP -C10 '^Check Required Env Variables\b' test/resources/c2cc.resource
rg -nP 'HOST2_VM_NAME|HOST3_VM_NAME' test/bin/c2cc_common.shRepository: openshift/microshift
Length of output: 264
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- test/resources/c2cc.resource: Check Required Env Variables ---'
rg -n -C 20 '^Check Required Env Variables\b|HOST2_VM_NAME|HOST3_VM_NAME|VM_NAME' test/resources/c2cc.resource
echo
echo '--- test/bin/c2cc_common.sh around HOST2/3 VM vars ---'
rg -n -C 20 'HOST2_VM_NAME|HOST3_VM_NAME|host2_vm|host3_vm|KUBECONFIG_B|KUBECONFIG_C' test/bin/c2cc_common.shRepository: openshift/microshift
Length of output: 225
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,260p' test/resources/c2cc.resource | nl -ba | sed -n '1,260p'Repository: openshift/microshift
Length of output: 198
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- files of interest ---'
git ls-files 'test/resources/*' 'test/bin/*' | sed -n '1,200p'
echo
echo '--- search for Check Required Env Variables across repo ---'
rg -n -C 4 'Check Required Env Variables' test
echo
echo '--- search for HOST2_VM_NAME / HOST3_VM_NAME across repo ---'
rg -n -C 4 'HOST2_VM_NAME|HOST3_VM_NAME' testRepository: openshift/microshift
Length of output: 44477
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Read the relevant resource file with line numbers, without relying on external tools like nl.
python3 - <<'PY'
from pathlib import Path
p = Path('test/resources/c2cc.resource')
text = p.read_text().splitlines()
for i, line in enumerate(text, 1):
if 'Check Required Env Variables' in line or 'HOST2_VM_NAME' in line or 'HOST3_VM_NAME' in line:
start = max(1, i-20)
end = min(len(text), i+40)
for j in range(start, end+1):
print(f"{j:4d}: {text[j-1]}")
break
PYRepository: openshift/microshift
Length of output: 158
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- matches for the keyword ---'
rg -n -C 6 '^Check Required Env Variables\b|Check Required Env Variables' test
echo
echo '--- c2cc resource files ---'
git ls-files 'test/resources/*c2cc*' 'test/resources/*.resource' 'test/bin/*c2cc*' | sed -n '1,200p'
echo
echo '--- c2cc.resource excerpts around VM-name vars if present ---'
python3 - <<'PY'
from pathlib import Path
p = Path('test/resources/c2cc.resource')
if p.exists():
text = p.read_text().splitlines()
for i, line in enumerate(text, 1):
if 'HOST2_VM_NAME' in line or 'HOST3_VM_NAME' in line or 'Check Required Env Variables' in line:
start = max(1, i-25)
end = min(len(text), i+35)
for j in range(start, end+1):
print(f"{j:4d}: {text[j-1]}")
print('---')
PYRepository: openshift/microshift
Length of output: 50376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where the setup keyword actually lives and how the VM-name vars are defined.
rg -n -C 8 'Check Required Env Variables|HOST2_VM_NAME|HOST3_VM_NAME' test/suites test/resources test/binRepository: openshift/microshift
Length of output: 50376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- all occurrences of HOST2_VM_NAME / HOST3_VM_NAME ---'
rg -n -C 3 'HOST2_VM_NAME|HOST3_VM_NAME' .
echo
echo '--- all occurrences of Check Required Env Variables ---'
rg -n -C 12 '^Check Required Env Variables\b' .
echo
echo '--- likely c2cc setup file around variable definitions ---'
python3 - <<'PY'
from pathlib import Path
for path in [Path('test/bin/c2cc_common.sh'), Path('test/resources/c2cc.resource')]:
if path.exists():
print(f'\n### {path}')
text = path.read_text().splitlines()
for i, line in enumerate(text, 1):
if 'HOST2_VM_NAME' in line or 'HOST3_VM_NAME' in line or 'Check Required Env Variables' in line:
start = max(1, i-20)
end = min(len(text), i+30)
for j in range(start, end+1):
print(f"{j:4d}: {text[j-1]}")
print('---')
PYRepository: openshift/microshift
Length of output: 12016
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
candidates = [p for p in Path('test').rglob('*') if p.is_file() and ('c2cc' in p.name or 'c2cc' in str(p))]
for p in sorted(candidates):
print(p)
PYRepository: openshift/microshift
Length of output: 1195
Add VM-name checks to Check Required Env Variables. HOST2_VM_NAME and HOST3_VM_NAME default to ${EMPTY} but aren’t validated there, so the NIC-outage cases fail later with a vague error.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/suites/c2cc/chaos.robot` around lines 149 - 157, The setup flow is
missing validation for VM name environment variables, which lets NIC-outage
tests fail later with unclear errors. Update Check Required Env Variables to
also require HOST2_VM_NAME and HOST3_VM_NAME alongside the existing
host/IP/port/kubeconfig checks, and ensure the Setup sequence still calls that
validation before Register Remote Cluster so failures surface early with a clear
message.
|
@pmtk: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
New Features
Bug Fixes